Introduce timeout for hanging workspaces #605

amisevsk · 2021-09-21T19:21:23Z

Note: this PR depends on #598 and so for now includes those changes as well; ignore all but the last three commits (i.e. 05c0441 to the end). Once #598 is merged, this commit will be a lot simpler.

What does this PR do?

Introduces config setting .workspace.startProgressTimeout that controls how long a workspace can be in the starting phase without progressing before it is automatically failed.

"Progressing" in our case means "updating workspace .status.conditions" -- if the last condition change (checked by lastTransitionTime) is more than idleTimeout ago, we assume the workspace has failed.

Default timeout value is 5 minutes, to allow for e.g. an incredibly slow image pull.

What issues does this PR fix or reference?

Closes #578

Is it tested? How?

Create or update the DevWorkspaceOperatorConfig to ignore FailedScheduling events (and optionally set a shorter idleTimeout):

kubectl patch dwoc devworkspace-operator-config --type merge -p \
  '{
    "config": {
      "workspace": {
        "ignoredUnrecoverableEvents": [
          "FailedScheduling"
        ],
        "startProgressTimeout": "30s"
      }
    }
  }'

Create a DevWorkspace that cannot be scheduled:

cat <<EOF | kubectl apply -f 
kind: DevWorkspace
apiVersion: workspace.devfile.io/v1alpha2
metadata:
  name: test-devworkspace
spec:
  started: true
  template:
    components:
      - name: tooling
        container:
          image: quay.io/wto/web-terminal-tooling
          args: ["tail", "-f", "/dev/null"]
          memoryRequest: 100Gi
          memoryLimit: 100Gi
EOF

Wait 30 seconds; workspace should fail to start.

PR Checklist

E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
- v8-devworkspace-operator-e2e: DevWorkspace e2e test
- v8-che-happy-path: Happy path for verification integration with Che

Add setting workspace.startProgressTimeout to denote maximum duration for any workspace phase before the workspace start is failed. Default value is 5m Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Signed-off-by: Angel Misevski <amisevsk@redhat.com>

JPinkney · 2021-09-23T14:59:09Z

apis/controller/v1alpha1/devworkspaceoperatorconfig_types.go

+	// a "Starting" phase without progressing before it is automatically failed.
+	// Duration should be specified in a format parseable by Go's time package, e.g.
+	// "15m", "20s", "1h30m", etc. If not specified, the default value of "5m" is used.
+	StartProgressTimeout string `json:"startProgressTimeout,omitempty"`


I wonder if this might make more sense to be something like StartPhaseTimeout? WDYT?

I renamed to ProgressTimeout to include Failing dws and forgot this comment -- sorry. Are you okay with that name? My thinking is it's the max duration a DevWorkspace can exist without progressing

JPinkney

Just tested and everything worked as expected

sleshchenko

LGTM I just have some question, which I may find by myself during testing, but I'm not there yet.

apis/controller/v1alpha1/devworkspaceoperatorconfig_types.go

controllers/workspace/status.go

sleshchenko · 2021-09-27T09:02:31Z

controllers/workspace/devworkspace_controller.go

@@ -355,6 +376,9 @@ func (r *DevWorkspaceReconciler) Reconcile(ctx context.Context, req ctrl.Request
 		}
 		reqLogger.Info("Waiting on deployment to be ready")
 		reconcileStatus.setConditionFalse(conditions.DeploymentReady, "Waiting for workspace deployment")
+		if !deploymentStatus.Requeue && deploymentStatus.Err == nil {
+			return reconcile.Result{RequeueAfter: startingWorkspaceRequeueInterval}, nil


Does here the last or the first internal win?

I mean we wait for deployment initially and scheduled reconciling in 5 secs (1);

Deployment created pods and it invokes reconcile loop, we scheduling in 5 secs (2);

Pod is updated several times after containers status are propagates, we scheduled N times in 5 secs(N);

The question is, how many times reconcile loops our RequeueAfter will initiate?

My understanding is that reconciling due to an event cancels out a requeueAfter, but it's hard to check for sure. In my testing, I don't see bursts of "waiting on deployment" in the logs, which is what would likely happen if we were stacking requeues.

sleshchenko · 2021-09-27T09:04:06Z

controllers/workspace/devworkspace_controller.go

@@ -59,6 +59,10 @@ import (
 	dw "github.com/devfile/api/v2/pkg/apis/workspaces/v1alpha2"
 )

+const (
+	startingWorkspaceRequeueInterval = 5 * time.Second


I wonder if 5 secs is always the best choice or we can make it dynamic and like to minute, 2,3 in case of 5minutes timeout?

I mostly picked 5 seconds because it feels long enough that it's not burdening the controller (we can reconcile many times a second) but also short enough to avoid strange issues in setting a timeout (e.g. if we set a 1 minute requeue, what happens if the config specifies a timeout of 1 minute 15 seconds?)

1 minute 15 seconds

I don't think there is a case for such a precise timeout. Maybe we can declare our precision to like 1minute or 30 seconds.

And actually, I thought about reconciling after (timeout + last transition time - now + 5 sec), which will initiate reconcile loop after 5 sec when potentially we need to kill that.

But I'm OK with any if it does not generate redundant loading.

I tested this informally by starting a workspace that will timeout (ignore FailedScheduling events and ask for 100Gi in Theia). While the workspace is looping on checking the deployment every 5 seconds, I ran

for i in {1..5}; do kubectl patch dw theia-next --type merge -p '{"metadata": {"labels": {"touch": "false"}}}' sleep 0.5 kubectl patch dw theia-next --type merge -p '{"metadata": {"labels": {"touch": "true"}}}' sleep 0.5 done

to trigger 10 reconciles to the object, with the assumption that each of those would also start a RequeueAfter. However, after the 10 quick reconciles are completed, the controller goes back to queuing reconciles every five seconds, rather than 10 reconciles every 5 seconds, so it seems like RequeueAfter is cancelled if an event triggers a reconcile.

I agree that 5 seconds may be the wrong value here; we should tweak this later.

Rename startProgressTimeout to progressTimeout and repurpose it to detect workspaces stuck in the "Failing" state for too long as well. Signed-off-by: Angel Misevski <amisevsk@redhat.com>

sleshchenko

I don't have any reason to delay delivering this valuable feature.
Approving and if I find any way to improve/optimize(if it's really needed) implementation - I'll raise a dedicated issue.

openshift-ci · 2021-09-28T13:29:23Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: amisevsk, JPinkney, sleshchenko

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JPinkney,amisevsk,sleshchenko]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sleshchenko · 2021-09-28T13:29:35Z

/test v8-devworkspace-operator-e2e, v8-che-happy-path

openshift-ci · 2021-09-28T14:39:52Z

@amisevsk: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/v8-che-happy-path	`9479380`	link	true	`/test v8-che-happy-path`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

sleshchenko · 2021-09-28T15:13:45Z

Che Happy path fails with

I assume it's caused by a health check that understands redirect as success.
So, the health endpoint should be added to allowlist on Che side, it should not be a blocker for this PR.
But also we can make DWO respect only 2XX response codes and don't follow redirect links, but it's another story.

Divine1 · 2022-07-08T13:52:57Z

hi team

i have installed eclipse-che in k8s cluster. i'm facing below timeout error.
Please let me know how can i increase the timeout limit here?
should i add StartProgressTimeout in devfile.yaml? please help me on this

i tried increasing the timeout in checluster, but it has no effect.

apiVersion: org.eclipse.che/v2
kind: CheCluster
metadata:
  name: eclipse-che
spec:
  components:
    cheServer:
      extraProperties:
        CHE_INFRA_KUBERNETES_WORKSPACE__START__TIMEOUT__MIN: "15"
    database:
      externalDb: true
      postgresHostName: sc2-10-186-67-195.eng.vmware.com
      postgresPort: "5432"
  networking:
    auth:
      identityProviderURL: https://dexvmware.com
      oAuthClientName: eclipse-che
      oAuthSecret: sdsdsdsdsd

@sleshchenko

@amisevsk

@dmytro-ndp

amisevsk · 2022-07-08T19:29:47Z

Hi @Divine1 -- that error will occur when the DevWorkspace does not reach a "Running" state before the start timeout (introduced in this PR). By default this is 5 minutes but can be increased by setting the .config.workspace.progressTimeout field in the DevWorkspaceOperatorConfig, e.g.:

apiVersion: controller.devfile.io/v1alpha1
kind: DevWorkspaceOperatorConfig
metadata:
  name: devworkspace-operator-config
  namespace: <devworkspace/che installation namespace>
config:
  workspace:
    progressTimeout: 15m # This sets the timeout to 15 minutes

However, if it's taking longer than 5 minutes for your workspace to start, that suggests a different problem. Do you have any information about the workspace or its pods while it's starting? Some things to check:

What phase does the workspace get stuck on (kubectl get dw <workspace-name>)
Is the workspace deployment running/ready? If so, is there anything in the logs there?

From here, we can start to narrow down what's causing the issue.

Feel free to open another issue in this repo and we can discuss there.

tolusha · 2022-08-18T14:37:01Z

@amisevsk @Divine1
So, the problem is the following. It takes more than 5m to pull images which causes workspace fail to start.
What are possible solutions?

Divine1 · 2022-08-18T14:41:57Z

@tolusha Thank you for checking this

i have posted logs in below link which show failing start workspace within 5m message. Also, is there any issue with project-clone init-container?

detailed logs are in this link eclipse-che/che#21613 (comment)

@amisevsk

tolusha · 2022-08-18T15:32:59Z

I think I found the problem. 5m is hardcoded timeout on dashboard

https://github.com/eclipse-che/che-dashboard/blob/f4b2988abb72cb14068d6f81f649783a5b1d132b/packages/dashboard-frontend/src/containers/Loader/const.tsx#L36
https://github.com/eclipse-che/che-dashboard/blob/f4b2988abb72cb14068d6f81f649783a5b1d132b/packages/dashboard-frontend/src/containers/Loader/Workspace/Steps/StartWorkspace/index.tsx#L158

Divine1 · 2022-08-18T15:46:44Z

@tolusha i'm able to see errors in kubectl get events -n workspace-namespace logs . if increasing timeout is the solution, then there shouldn't be other errors in the kubectl get events -n workspace-namespace logs, right?

One more doubt... i noticed, the workspace deployment component is applied immediately after the PVC is issued, but more seconds is needed to provision a PV. Should there be any delay between applying workspace deployment and PV provisioning?

amisevsk · 2022-08-22T15:43:46Z

If the issue is that pulling images takes longer than 5 minutes, then the solutions would be either

Increase the timeout as specified in Introduce timeout for hanging workspaces #605 (comment) (I don't believe the dashboard is failing workspaces, and only the duration in the error is hard-coded)
Install and configure the Kubernetes Image Puller in the cluster to automatically pull images before workspaces are started.
Manually ensure every node in the cluster has the images required (the approach the image puller uses is to create a daemonset to run a pod with those images on every node)

amisevsk requested review from sleshchenko and JPinkney September 21, 2021 19:21

openshift-ci bot added the approved label Sep 21, 2021

amisevsk added 3 commits September 22, 2021 23:42

Add startProgressTimeout config setting

14dbb31

Add setting workspace.startProgressTimeout to denote maximum duration for any workspace phase before the workspace start is failed. Default value is 5m Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Regenerate templates to include startProgressTimeout config

bc0d35a

Signed-off-by: Angel Misevski <amisevsk@redhat.com>

Fail workspace startup if any phase takes longer than timeout

cda34d6

Signed-off-by: Angel Misevski <amisevsk@redhat.com>

amisevsk force-pushed the introduce-timeout branch from 7ab914e to cda34d6 Compare September 23, 2021 03:43

JPinkney reviewed Sep 23, 2021

View reviewed changes

JPinkney approved these changes Sep 23, 2021

View reviewed changes

openshift-ci bot assigned JPinkney Sep 23, 2021

openshift-ci bot added the lgtm label Sep 23, 2021

sleshchenko reviewed Sep 27, 2021

View reviewed changes

Rename startProgressTimeout to progressTimeout and use for Failing dws

9479380

Rename startProgressTimeout to progressTimeout and repurpose it to detect workspaces stuck in the "Failing" state for too long as well. Signed-off-by: Angel Misevski <amisevsk@redhat.com>

openshift-ci bot removed the lgtm label Sep 27, 2021

sleshchenko approved these changes Sep 28, 2021

View reviewed changes

openshift-ci bot assigned sleshchenko Sep 28, 2021

openshift-ci bot added the lgtm label Sep 28, 2021

sleshchenko mentioned this pull request Sep 28, 2021

Propagate DevWorkspace .spec.started status to an annotation on routings #617

Merged

3 tasks

sleshchenko merged commit 8b175a5 into devfile:main Sep 30, 2021

amisevsk deleted the introduce-timeout branch September 30, 2021 15:54

tolusha mentioned this pull request Jul 8, 2022

devworkspace failed to progress past phase 'Starting' for longer than timeout (5m) eclipse-che/che#21380

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce timeout for hanging workspaces #605

Introduce timeout for hanging workspaces #605

amisevsk commented Sep 21, 2021 •

edited

Loading

JPinkney Sep 23, 2021

amisevsk Sep 30, 2021

JPinkney left a comment

sleshchenko left a comment

sleshchenko Sep 27, 2021

amisevsk Sep 27, 2021

sleshchenko Sep 27, 2021

amisevsk Sep 27, 2021

sleshchenko Sep 28, 2021

amisevsk Sep 30, 2021 •

edited

Loading

sleshchenko left a comment

openshift-ci bot commented Sep 28, 2021

sleshchenko commented Sep 28, 2021

openshift-ci bot commented Sep 28, 2021

sleshchenko commented Sep 28, 2021

Divine1 commented Jul 8, 2022 •

edited

Loading

amisevsk commented Jul 8, 2022 •

edited

Loading

tolusha commented Aug 18, 2022

Divine1 commented Aug 18, 2022 •

edited

Loading

tolusha commented Aug 18, 2022

Divine1 commented Aug 18, 2022 •

edited

Loading

amisevsk commented Aug 22, 2022

Introduce timeout for hanging workspaces #605

Introduce timeout for hanging workspaces #605

Conversation

amisevsk commented Sep 21, 2021 • edited Loading

What does this PR do?

What issues does this PR fix or reference?

Is it tested? How?

PR Checklist

JPinkney Sep 23, 2021

Choose a reason for hiding this comment

amisevsk Sep 30, 2021

Choose a reason for hiding this comment

JPinkney left a comment

Choose a reason for hiding this comment

sleshchenko left a comment

Choose a reason for hiding this comment

sleshchenko Sep 27, 2021

Choose a reason for hiding this comment

amisevsk Sep 27, 2021

Choose a reason for hiding this comment

sleshchenko Sep 27, 2021

Choose a reason for hiding this comment

amisevsk Sep 27, 2021

Choose a reason for hiding this comment

sleshchenko Sep 28, 2021

Choose a reason for hiding this comment

amisevsk Sep 30, 2021 • edited Loading

Choose a reason for hiding this comment

sleshchenko left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Sep 28, 2021

sleshchenko commented Sep 28, 2021

openshift-ci bot commented Sep 28, 2021

sleshchenko commented Sep 28, 2021

Divine1 commented Jul 8, 2022 • edited Loading

amisevsk commented Jul 8, 2022 • edited Loading

tolusha commented Aug 18, 2022

Divine1 commented Aug 18, 2022 • edited Loading

tolusha commented Aug 18, 2022

Divine1 commented Aug 18, 2022 • edited Loading

amisevsk commented Aug 22, 2022

amisevsk commented Sep 21, 2021 •

edited

Loading

amisevsk Sep 30, 2021 •

edited

Loading

Divine1 commented Jul 8, 2022 •

edited

Loading

amisevsk commented Jul 8, 2022 •

edited

Loading

Divine1 commented Aug 18, 2022 •

edited

Loading

Divine1 commented Aug 18, 2022 •

edited

Loading